
For the Ozone data from the R package mlbench try the following machine learning prediction algorithm. Read the paper Feature Selection with the Boruta Package and implement the algorithm. Build prediction model for the Ozone variable. Which features are most important?
NAME: SAJITH GOWTHAMAN NET ID: ek5282
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("Ozone.csv")
df.head()
df.isna().sum()
df.mean()
df.fillna(df.mean(), inplace = True)
Let's check the predicting target's data, whether we can do a classification or regression model.
df.info()
df.describe()
sns.pairplot(df)
df.V4 = round(df.V4)
df.V4 = pd.to_numeric(df['V4'])
df.V4.unique()
##Let's Load X and Y
X=df.loc[:, df.columns != 'V4'].values
Y=df.iloc[:,4].values.ravel()
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
rf = RandomForestClassifier(n_jobs=-1, class_weight=None, max_depth=2)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, max_iter = 30)
# find all relevant features
feat_selector.fit(X, Y)
# check ranking of features
feat_selector.ranking_
feat_selector.support_
print('==============BORUTA==============')
print (feat_selector.n_features_)
features = [f for f in df.columns if f not in ['V4']]
len(features)
important_features = list()
indexes = np.where(feat_selector.support_ == True)
for x in np.nditer(indexes):
important_features.append(features[x])
print('The Most Important Features Are:{}'.format(important_features))
unimportant_features = list()
indexes = np.where(feat_selector.support_ == False)
for x in np.nditer(indexes):
unimportant_features.append(features[x])
print('The Unimportant Features Are:{}'.format(unimportant_features))
imp_df = pd.DataFrame(df[important_features])
imp_df
#call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X)
X_filtered.shape
X_filtered_df = pd.DataFrame(X_filtered)
X_filtered_df
Let's check how the boruta model has changed the distribution of the 10 variables with HvPlot
import hvplot.pandas
import holoviews as hv
(X_filtered_df.hvplot(kind='scatter', x='0', y='1', by='0') + X_filtered_df.hvplot(kind='scatter', x='0', y='1', by='1') + X_filtered_df.hvplot(kind='scatter', x='0', y='1', by='1'))
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn import ensemble
X = imp_df
Y= df.V4
df_rf_clf = ensemble.RandomForestClassifier(n_estimators=23)
df_rf_clf.fit(X,Y)
predictions = df_rf_clf.predict(X)
predictions